{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An introduction to solving biological problems with Python\n",
    "\n",
    "## Session 2.4: BioPython\n",
    "\n",
    "- [Working with sequences](#Working-with-sequences)\n",
    "- [Connecting with biological databases](#Connecting-with-biological-databases)\n",
    "- [Exercises 2.4.1](#Exercises-2.4.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using third party library, BioPython\n",
    "\n",
    "Biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html\n",
    "\n",
    "The goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a sequence by defining a `Seq` object with strings. `Bio.Seq()` takes as input a string and converts in into a Seq object. We can print the sequences, individual residues, lengths and use other functions to get summary statistics.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Creating sequence\n",
    "from Bio.Seq import Seq\n",
    "my_seq = Seq(\"AGTACACTGGT\")\n",
    "print(my_seq)\n",
    "print(my_seq[10])\n",
    "print(my_seq[1:5])\n",
    "print(len(my_seq))\n",
    "print(my_seq.count( \"A\" ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use functions from `Bio.SeqUtils` to get idea about a sequence "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Calculate the molecular weight\n",
    "from Bio.SeqUtils import GC, molecular_weight\n",
    "print(GC( my_seq ))\n",
    "print(molecular_weight( my_seq ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One letter code protein sequences can be converted into three letter codes using `seq3` utility "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from Bio.SeqUtils import seq3\n",
    "print(seq3( my_seq ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alphabets defines how the strings are going to be treated as sequence object. `Bio.Alphabet` module defines the available alphabets for Biopython. `Bio.Alphabet.IUPAC` provides basic definition for DNA, RNA and proteins. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from Bio.Alphabet import IUPAC\n",
    "my_dna = Seq(\"AGTACATGACTGGTTTAG\", IUPAC.unambiguous_dna)\n",
    "print(my_dna)\n",
    "print(my_dna.alphabet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "my_dna.complement()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "my_dna.reverse_complement()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "my_dna.translate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parsing sequence file format: FASTA files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sequence files can be parsed and read the same way we read other files. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with open( \"data/glpa.fa\" ) as fileObj:\n",
    "    print(fileObj.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Biopython provides specific functions to allow parsing/reading sequence files. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Reading FASTA files\n",
    "from Bio import SeqIO\n",
    "\n",
    "fileObj = open(\"data/glpa.fa\")\n",
    "\n",
    "for protein in SeqIO.parse(fileObj, 'fasta'):\n",
    "    print(protein.id)\n",
    "    print(protein.seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sequence objects can be written into files using file handles with the function `SeqIO.write()`. We need to provide the name of the output sequence file and the sequence file format. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Writing FASTA files\n",
    "from Bio.SeqRecord import SeqRecord\n",
    "from Bio.Seq import Seq\n",
    "from Bio.Alphabet import IUPAC\n",
    "\n",
    "sequence = 'MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFG'\n",
    "\n",
    "fileObj = open( \"mySeqFile.fa\", \"w\")\n",
    "  \n",
    "seqObj = Seq(sequence, IUPAC.protein)\n",
    "proteinObjs = [SeqRecord(seqObj, id=\"MYID\", description='my description'),]\n",
    "\n",
    "SeqIO.write(proteinObjs, fileObj,  'fasta')\n",
    "\n",
    "fileObj.close()\n",
    "\n",
    "with open( \"biopython.fa\" ) as fileObj:\n",
    "    print(fileObj.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connecting with biological databases"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sequences can be searched and downloaded from public databases. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Read FASTA file from NCBI GenBank\n",
    "from Bio import Entrez\n",
    "\n",
    "Entrez.email = 'A.N.Other@example.com'\n",
    "socketObj = Entrez.efetch(db=\"protein\", rettype=\"fasta\", id=\"71066805\")\n",
    "dnaObj = SeqIO.read(socketObj, \"fasta\")\n",
    "socketObj.close()\n",
    "\n",
    "print(dnaObj.description)\n",
    "print(dnaObj.seq)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Read SWISSPROT record\n",
    "from Bio import ExPASy\n",
    "\n",
    "socketObj = ExPASy.get_sprot_raw('HBB_HUMAN')\n",
    "proteinObj = SeqIO.read(socketObj, \"swiss\")\n",
    "socketObj.close()\n",
    "\n",
    "print(proteinObj.description)\n",
    "print(proteinObj.seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises 2.4.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "- Retrieve a FASTA file named `data/sample.fa` and answer the following questions:\n",
    "  - How many sequences are in the file?\n",
    "  - What are the IDs and the lengths of the longest and the shortest sequences?\n",
    "  - Create a new object that contains only sequences with length longer than 500bp. What is the average length of these sequences?\n",
    "  - Calculate and print the percentage of GC contents in each of the sequences.\n",
    "  - Write the newly created sequence object into a FASTA file named `sample.long.fa` "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Congratulation! You reached the end of day 2! "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
